In this project we will explore data from Prosper loans that contains information on 113937 loans made through Prosper between 2005 and 2014. For the purposes of this project we will imagine that we are a competitor to Prosper and will attempt to use these data to predict the interest rate that a potential borrower would be offered by Prosper so that we can offer a slightly cheaper rate.
According to Wikipedia, Prosper is America’s first peer-to-peer lender in which borrowers request funds and other individuals and institutions can fund such loan requests. Although Prosper loans were initially auctioned, since July 2009 rates are set by Prosper’s model and lenders can choose whether or not they wish to fund the loan. Prosper determines its rates using a traditional credit score and its own proprietary Prosper score based on their own historical data.
Now let’s look at some of our variables. Our dependent variable is BorrowerRate. We are trying to predict the rate that a borrower would be offered on the Prosper website. Also, since Prosper changed how it set the rates for its loans we will create a NewRegime variable that is TRUE if the loan has a ProsperRating, the first of which occurs on July 20, 2009.
## [1] "Summary of Borrower Rate"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1334 0.1830 0.1924 0.2500 0.4975
The data appear to be more normally distributed when square root transformed. The older data contain a number of very low interest rates (including 0) which are not found in the new data. In either event we note a spike in rates at around 35%. We need to investigate this spike since the presence of these outlier values will surely affect the predictions of our linear model.
Below we plot a daily count of all loans greater at rates above 32%. It appears to show spikes and crashes, the first of which seems to coincide with the financial crisis of 2008.
Let’s look more closely at the loans in the histogram spike. Are they at very specific rates or are there just a lot of loans in that range?
Looks like there are a lot of loans at two specific rates. Hard to see exactly what they are in the graph, so we print counts for all of the rates in that area.
## [1] 0.3175
## [1] 8
## [1] 0.3176
## [1] 2
## [1] 0.3177
## [1] 3332
## [1] 0.3178
## [1] 1
## [1] 0.3179
## [1] 2
## [1] 0.318
## [1] 1
## [1] 0.3181
## [1] 0
## [1] 0.3182
## [1] 0
## [1] 0.3183
## [1] 0
## [1] 0.3184
## [1] 0
## [1] 0.3185
## [1] 27
## [1] 0.3186
## [1] 0
## [1] 0.3187
## [1] 0
## [1] 0.3188
## [1] 4
## [1] 0.3189
## [1] 2
## [1] 0.319
## [1] 4
## [1] 0.3191
## [1] 0
## [1] 0.3192
## [1] 0
## [1] 0.3193
## [1] 0
## [1] 0.3194
## [1] 0
## [1] 0.3195
## [1] 10
## [1] 0.3196
## [1] 0
## [1] 0.3197
## [1] 1
## [1] 0.3198
## [1] 3
## [1] 0.3199
## [1] 1649
## [1] 0.32
## [1] 76
We find 1651 loans with an interest rate of exactly 31.99% and 3672 loans with an interest rate of 31.77%. We will look more closely at these loans to see if we can figure out why there are so many at these rates.
Although there is some overlap, it appears that the 31.99% rates were offered in 2011 and the 31.77% in 2012. Is there an equivalent maximum for loans after that date that we have missed? Let’s get a histogram of rates for loans in 2013 and after.
## Warning: position_stack requires constant width: output may be incorrect
## [1] 0.313
## [1] 0
## [1] 0.3131
## [1] 0
## [1] 0.3132
## [1] 0
## [1] 0.3133
## [1] 0
## [1] 0.3134
## [1] 718
## [1] 0.3135
## [1] 0
## [1] 0.3136
## [1] 0
## [1] 0.3137
## [1] 0
## [1] 0.3138
## [1] 0
## [1] 0.3139
## [1] 0
## [1] 0.314
## [1] 0
For loans 2013 and beyond we find that there are many loans (718) at 31.34%.
We now expect to find a similar rate ceiling in the data between the start of the new rate setting method in December 2009 and the beginning of 2011 when the ceiling appears to have become 31.99%.
## Warning: position_stack requires constant width: output may be incorrect
## [1] 0.3495
## [1] 15
## [1] 0.3496
## [1] 0
## [1] 0.3497
## [1] 0
## [1] 0.3498
## [1] 3
## [1] 0.3499
## [1] 8
## [1] 0.35
## [1] 800
## [1] 0.3501
## [1] 0
## [1] 0.3502
## [1] 0
For loans between December 2009 and January 2011 we find a spike in the histogram of rates at 35.00% with 800 loans.
We create a dataset containing only those loans where BorrowerRate is equal to one of the ceilings we have identified above, and then plot counts of those loans across time. We see that those rates in fact appear to be very limited to specific time windows.
Now we create a new variable IsCeiling for variables in the new regime. It is a Boolean that indicates if the loan was made at the prevailing ceiling rate that we have identified. Note that only a fraction of total loans were made at this ceiling rate (7.6%).
So it appears that the spike we see in the histogram of all rates around 34% is due to a cap on the rates that Prosper offers to its borrowers (or at least most of them). That cap changes about once a year, but it remains unclear if it is due to regulatory factors or internal Prosper decisions. According to this article up until April 15 2008 the maximum rate Prosper could offer a lender was determined by regulations in the borrowers state, but after that point there was a cap of 36% for all states except Texas. This suggests that the rate ceilings we are finding in the data are set internally by Prosper.
Since we are ultimately interested in predicting rates offered by Prosper, from this point forward we will mainly be interested in loans made since December 19 2009 under the new rate setting mechanism. We subset the data accordingly before analyzing independent variables.
Now we look at our independent variables, many of which are ranked. We expect that any factor that increases the perceived probabilty of loan default will correlate with higher interest rates.
LoanOriginalAmount## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 7500 9114 14000 35000
Here we see that most loans are multiples of $5000. The bulk of the loans appear to be in the $5000 range with progressively fewer of the larger loans. A square root transformation of the data yield a more normal distribution but it is still lumpy.
## 12 36 60
## 1561 57664 24070
IncomeRange - we expect that individuals with higher incomes are likely to pay lower rates. Most of the borrowers have yearly incomes of at least $25000.## $0 $1-24,999 $25,000-49,999 $50,000-74,999 $75,000-99,999
## 656 4515 23726 25212 14255
## $100,000+
## 14931
CreditRating - this number is obtained from a traditional credit report. The untransformed data is positively skewed, but looks more normal when square root transformed.## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 600.0 660.0 700.0 699.4 720.0 880.0
## Warning: position_stack requires constant width: output may be incorrect
IsBorrowerHomeowner - This variable simply indicates whether or not the borrower owns a home. A small majority of borrowers are homeowners.## False True
## 39291 44004
ListingCategory - This indicates the stated purpose of the loan. Most lenders are seeking to consolidate other loans. This is even true for those that are taking high interest loans which seems odd since consolidation loans are usually for the purpose of lowering rates.The dataset includes information from 110079 loans measured in 31 variables. We have continuous variables as well as categories and ranked categories.
As noted, the main feature of interest is the BorrowerRate. We would like to model this rate so that we can guess at the rates a borrower would receive from Prosper.
We will investigate LoanOriginationAmount', 'Term', 'CreditScore', 'IncomeRange', 'IsBorrowerHomeowner',ListingCategory’ and Month. We also add data on 3-month T-bill rates (TB3MS) and credit card rates (CCRate) obtained from the FederalReserve.
We created a Month variable to determine if interest rates changed according to the time of the year. We reasoned that there may be times of the year when borrowers are more desperate and willing to accept higher rates.
We also created a NewRegime variable, a Boolean indicating whether or not the loan was offered after the change in Prosper’s methodology. We treat any loan with a ProsperRating as part of the new regime.
As noted above, we created an IsCeiling Boolean that indicates if the loan was made at the prevailing Prosper rate ceiling.
We also created two variables from data obtained outside the dataset. Both to indicate prevailing interest rates at the time of the loan. They are TB3MS to indicate the prevailing interest rates on US 3-month Treasury Bills and CCRate to indicate the prevailing credit card interest rates at the time of the loans. Both these datasets were obtained from the Federal Reserve.
BorrowerRate, LoanOriginalAmount and CreditScore all had long-tailed distributions and so were square root transformed so that the distribution would be more normal. The data were adjusted so that dates would be treated as such by R. We also made sure that variables were of the appropriate data type and that ranked variables were ordered correctly.
The BorrowerRate variable was of particular concern due to a spike in loans at around 35%. This spike represents a challenge for linear modeling of the data. Our investigation suggests that the spike is due to a Prosper internal rate ceiling.
We now want to look at the relationship between our dependent variable (BorrowerRate) and the dependent variables we have identified above.
CreditScoreThis scatter plot does seem to show the expected negative correlation between CreditScore and BorrowerRate, though we also note that the data are very noisy. A Pearson’s test of reveals a negative correlation between CreditScore and BorrowerRate (r = -0.5309) and so is a good candidate for inclusion in our model.
IncomeRange
we would expect that borrowers with higher incomes are less risky and could therefore get better rates. Using Pearson’s test, we calculate a negative correlation between IncomeRange and BorrowerRate (= -0.251). So IncomeRange is also likely to be a good predictor variable for our model.
DebtToIncomeRatio
## Warning: Removed 7107 rows containing missing values (geom_point).
## [1] 0.1270731
As expected, DebtToIncomeRatio is positively correlated with BorrowerRate, but the relationship is not particularly strong(r = 0.1271)
Term## ldNew$Term: 12
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0400 0.0929 0.1434 0.1508 0.2064 0.2669
## --------------------------------------------------------
## ldNew$Term: 36
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0423 0.1274 0.1899 0.1980 0.2699 0.3600
## --------------------------------------------------------
## ldNew$Term: 60
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0669 0.1486 0.1852 0.1923 0.2310 0.3304
It appears that rates for the 12-month loans are slightly better that for those with longer terms. However, there appears to be very little correlation between BorrowerRate and Term (r = 0.0277).
LoanOriginalAmount -We find a negative correlation between LoanOriginalAmount and BorrowerRate (r = -0.402). This may be due to the fact that the administrative costs of servicing, likely to be relatively uniform for all loans, make up a larger portion of the cost of a loan for small loans.
IsBorrowerHomeowner -We find a significant difference between rates offered to homeowners compared to non-homeowners.
ListingCategoryHere we see that loans for which the listing category is “Not Available” tend to have higher rates. This is unsurprising, but it does seem odd that loans for category “Other” are not similarly penalized. Loans for “Baby & Adoption” seem to have relatively low rates, most likely because only more worthy borrowers are seeking loans of this type. It also appears that numerous individuals have taken very high interest loans in order to purchase engagement rings. An ANOVA reveals that there are significant differences in the loans offered for different categories (F = 106.6026, p < 2e-16).
We would expect that the rates on offer on Prosper would correlate with prevailing interest rates for other types of loans. Since most of the loans are for a period of 36 months, we will add a variable to our dataset that indicates the prevailng interest rate for these bonds at the time the Prosper loan was made.
## [1] "Correlation between BorrowerRate and TB3MS = 0.0817"
## [1] "Correlation between BorrowerRate and CCRate = 0.0333"
In both cases we find only tiny positive correlations between BorrowerRate and the other rates. We also note that in most cases the Prosper rates are much higher than prevailing credit card interest rates. This suggests that people who borrow on Prosper are unable to obtain funding through more traditional channels.
It may be that rates on Prosper are higher or lower at different times of the year. To check that we create boxplots of BorrowerRate showing the rates in each of the different months of the year.
It does appear to be the case that rates are higher in the summer months (F = 227.6630385, p < 2e-16). It also seems to the be the case that in the beginning of the year BorrowerRate is negaitively skewed and at the end of the year it is positively skewed with a relatively small number of high interest loans that bring up the average.
As noted above we find negative correlations between BorrowerRate and CreditScore, IncomeRange and LoanOriginalAmount. We fail to find expected correlation between BorrowerRate and CCRate. We find the length of a loan to be a poor predictor of the rate for that loan.
It was surprising to see that CreditScore and IncomeRange did not correlate very strongly. (r = 0.1461226)
Unsuprisingly, the strongest relationship that we found was between BorrowerRate and CreditScore.
At each income range, homeowners tend to pay lower rates.
Those that are unable to provide proof of their stated income pay higher rates. This appears to be true even for those that have no income, but the divergence is especially significant at the higher income ranges.
The lowest rates are payed by homeowners with good credit.
The lowest rates go to high income individuals with good credit.
We find that IncomeRange is a more powerful predictor when IncomeVerifiable is TRUE.
We did not find any. The most surprising discoveries were that seeming lack of correlation between CCRate and BorrowerRate
##
## Calls:
## m5: lm(formula = sqrt(BorrowerRate) ~ sqrt(CreditScore) + IncomeRange +
## IsBorrowerHomeowner + Term + sqrt(LoanOriginalAmount), data = ldNew)
## m6: lm(formula = sqrt(BorrowerRate) ~ sqrt(CreditScore) + IncomeRange +
## IsBorrowerHomeowner + Term + sqrt(LoanOriginalAmount) + IncomeVerifiable,
## data = ldNew)
## m7: lm(formula = sqrt(BorrowerRate) ~ sqrt(CreditScore) + IncomeRange +
## IsBorrowerHomeowner + Term + sqrt(LoanOriginalAmount) + IncomeVerifiable +
## DebtToIncomeRatio, data = ldNew)
## m8: lm(formula = sqrt(BorrowerRate) ~ sqrt(CreditScore) + IncomeRange +
## IsBorrowerHomeowner + Term + sqrt(LoanOriginalAmount) + IncomeVerifiable +
## DebtToIncomeRatio + ListingCategory, data = ldNew)
## m9: lm(formula = sqrt(BorrowerRate) ~ sqrt(CreditScore) + IncomeRange +
## IsBorrowerHomeowner + Term + sqrt(LoanOriginalAmount) + IncomeVerifiable +
## DebtToIncomeRatio + ListingCategory + Month, data = ldNew)
##
## ============================================================================================
## m5 m6 m7 m8 m9
## --------------------------------------------------------------------------------------------
## (Intercept) 1.735*** 1.749*** 1.661*** 1.704*** 1.695***
## (0.008) (0.008) (0.067) (0.068) (0.067)
## sqrt(CreditScore) -0.045*** -0.045*** -0.046*** -0.048*** -0.048***
## (0.000) (0.000) (0.000) (0.000) (0.000)
## IncomeRange: $1-24,999/$0 -0.042*** -0.020*** 0.048 0.045 0.039
## (0.003) (0.003) (0.066) (0.065) (0.064)
## IncomeRange: $25,000-49,999/$0 -0.060*** -0.036*** 0.038 0.035 0.030
## (0.003) (0.003) (0.066) (0.065) (0.064)
## IncomeRange: $50,000-74,999/$0 -0.070*** -0.046*** 0.029 0.026 0.020
## (0.003) (0.003) (0.066) (0.065) (0.064)
## IncomeRange: $75,000-99,999/$0 -0.072*** -0.047*** 0.028 0.025 0.018
## (0.003) (0.003) (0.066) (0.065) (0.064)
## IncomeRange: $100,000+/$0 -0.071*** -0.047*** 0.029 0.025 0.018
## (0.003) (0.003) (0.066) (0.065) (0.064)
## IsBorrowerHomeowner: True/False 0.011*** 0.012*** 0.012*** 0.012*** 0.012***
## (0.001) (0.001) (0.001) (0.001) (0.001)
## Term: .L 0.060*** 0.060*** 0.062*** 0.063*** 0.060***
## (0.001) (0.001) (0.001) (0.001) (0.001)
## Term: .Q -0.013*** -0.013*** -0.013*** -0.014*** -0.014***
## (0.001) (0.001) (0.001) (0.001) (0.001)
## sqrt(LoanOriginalAmount) -0.001*** -0.001*** -0.001*** -0.001*** -0.001***
## (0.000) (0.000) (0.000) (0.000) (0.000)
## IncomeVerifiable: True/False -0.027*** 0.007 0.007 0.018
## (0.001) (0.009) (0.009) (0.009)
## DebtToIncomeRatio 0.028*** 0.029*** 0.029***
## (0.001) (0.001) (0.001)
## ListingCategory: 1 -0.023 -0.020
## (0.015) (0.015)
## ListingCategory: 2 -0.008 -0.007
## (0.015) (0.015)
## ListingCategory: 3 0.001 0.003
## (0.015) (0.015)
## ListingCategory: 5 -0.034* -0.032*
## (0.016) (0.015)
## ListingCategory: 6 -0.020 -0.018
## (0.015) (0.015)
## ListingCategory: 7 -0.007 -0.004
## (0.015) (0.015)
## ListingCategory: 8 -0.020 -0.015
## (0.016) (0.015)
## ListingCategory: 9 -0.026 -0.025
## (0.017) (0.016)
## ListingCategory: 10 0.012 0.014
## (0.017) (0.016)
## ListingCategory: 11 -0.016 -0.013
## (0.016) (0.015)
## ListingCategory: 12 -0.012 -0.007
## (0.018) (0.018)
## ListingCategory: 13 -0.007 -0.005
## (0.015) (0.015)
## ListingCategory: 14 -0.019 -0.015
## (0.015) (0.015)
## ListingCategory: 15 -0.017 -0.014
## (0.015) (0.015)
## ListingCategory: 16 -0.022 -0.020
## (0.015) (0.015)
## ListingCategory: 17 -0.013 -0.008
## (0.018) (0.017)
## ListingCategory: 18 -0.013 -0.009
## (0.015) (0.015)
## ListingCategory: 19 -0.017 -0.015
## (0.015) (0.015)
## ListingCategory: 20 -0.005 -0.003
## (0.015) (0.015)
## Month: February/January -0.001
## (0.001)
## Month: March/January 0.012***
## (0.001)
## Month: April/January 0.028***
## (0.001)
## Month: May/January 0.030***
## (0.001)
## Month: June/January 0.031***
## (0.001)
## Month: July/January 0.033***
## (0.001)
## Month: August/January 0.031***
## (0.001)
## Month: September/January 0.028***
## (0.001)
## Month: October/January 0.022***
## (0.001)
## Month: November/January 0.015***
## (0.001)
## Month: December/January 0.009***
## (0.001)
## --------------------------------------------------------------------------------------------
## R-squared 0.391 0.398 0.414 0.422 0.441
## adj. R-squared 0.391 0.398 0.414 0.422 0.441
## sigma 0.067 0.067 0.066 0.065 0.064
## F 5353.420 5012.823 4484.163 1793.565 1431.378
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood 106617.736 107104.212 99436.644 99961.823 101248.435
## Deviance 377.007 372.629 327.925 323.435 312.694
## AIC -213211.472 -214182.423 -198845.287 -199857.647 -202408.870
## BIC -213099.511 -214061.131 -198715.914 -199552.695 -202002.268
## N 83295 83295 76188 76188 76188
## ============================================================================================
We created a linear model incorporating nine of the variables that we investigated earlier. The main strength of this method is that it uses only a small number of variables and a simple model to make its predictions. The main weakness of this method is that it does a very poor job predicting the BorrowerRate from the given data.
The plots above are intended to give a quick overview of the history of Prosper Loans. It shows steady growth in the volume of loans on Prosper since its reopening and, more recently, a steady decrease in the average interest rate offered to its borrowers suggesting that it is attracting the more reliable borrowers that it needs to survive.
This plot shows a sample of loans since Prosper reopened on July 15, 2009. It shows BorrowerRate against its 3 best predictors. We see that nearly all of the large loans are to high income borrowers, that low income borrowers tend to pay very high rates and that although average rates on Prosper may be high, some borrowers are getting very attractive rates compared to other comsumption loans.
The plot above shows the frequency of loans at different rates to borrowers with different credit scores. This plot helps to counteract the impression given by Plot 1 that Prosper borrowers will always do worse than credit card borrowers. In fact we see that many borrowers, especially those with good credit, are able to take loans at rates much more attractive than those typically offered by credit card companies.
In this project we explored the possibility of predicting the rate that Prosper Loans would offer its borrowers based on simple information available from borrowers and their credit reports. We were unable, in this preliminary investigation of the data, to develop a model that would reliably accomplish that task. Nevertheless, important information has been developed that can aid us in the future. We were able to determine for example, that Prosper appears to place an internal ceiling on the rates it offers and that that ceiling seems to change at the beginning of the year. We also found that although aggregate interest rates at Prosper are quite high compared to other comsumption loans, there are also many loans given at much more attractive rates for more qualified borrowers. A surprise finding was the lack of a link between Prosper rates and other prevailing interest rates.
The decision to subset the data to those loans that were made since Prosper reopened with a new methodology, although a sensible one, may have resulted in useful data being discarded. Similarly, the decision to model only a fraction of the available variables allowed for more in depth analysis of those variables, but also deprived the model of potentially crticial information.
Our failure to model the rates that Prosper offers to its lenders suggests several additional courses of action. For one we should spend considerably more time increasing our domain knowledge of Prosper and of consumer lending in general. Second, we should spend more time considering the many other variables that are available from Prosper to determine which of those, if any, can improve our model.